Practical 1 aims at analyzing water levels of the longest (i.e. 295km) Swiss river (rises in the Bernese Alps and ends at the junction with the Rhine) at a measuring station in Untersiggenthal next to the Paul Scherrer Institute where nuclear reactors are located, making this location very sensitive in case of flood damage.
The next table displays the first six rows of the niveau data set.
The data set collects the Aare’s daily maximum water levels in one unique station in Stilli, Untersiggenthal (canton of Aargau) and records the exact times at which daily maxima are detected.
| Stationsname | Stationsnummer | Parameter | Zeitreihe | Parametereinheit | Gewässer | Zeitstempel | Zeitpunkt_des_Auftretens | Wert | Freigabestatus |
|---|---|---|---|---|---|---|---|---|---|
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-01 00:00:00 | 2000-01-01 00:23:00 | 326.245 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-02 00:00:00 | 2000-01-02 00:43:10 | 326.153 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-03 00:00:00 | 2000-01-03 00:00:00 | 326.053 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-04 00:00:00 | 2000-01-04 01:43:40 | 325.871 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-05 00:00:00 | 2000-01-05 21:23:00 | 325.837 | Freigegeben, validierte Daten |
| Untersiggenthal, Stilli | 2205 | Pegel | Tagesmaxima | m ü.M. | Aare | 2000-01-06 00:00:00 | 2000-01-06 03:33:05 | 325.835 | Freigegeben, validierte Daten |
The below plot shows the evolution of daily water maxima over 21 years from 01-01-2000 to 01-08-2021.
We see that the lowest troughs are quite constant over the years (~325.5 meter above sea level) whereas the highest peaks really distinguish themselves among the overall peaks. [SHOULD WE THEN REMOVE THE WEIRD THREE VALUES ?] Indeed, we observe that peaks are mainly observed in the Summer months with the highest peaks recorded on 25-08-2005 at 328.827 (m.a.s.l), on 09-06-2007 at 329.323 (m.a.s.l) and on 14-07-2021 at 328.622 (m.a.s.l). Therefore, over 21 years three years have had extremely and unusually high water levels during the summer, respectively 2005, 2007 and 2021.
We also observe a pattern of the maxima better shown by the green smoothed curve which indicates that water levels fluctuate according to a specific logic within a year (i.e. seasonality?). [How can we assess an homogeneity issue because of a trend ?] -> if seasonality we need a moving threshold and have two separate modeling ??????
The following output shows the five-number statistics summary of the water level values.
The summary shows that the water level values are not symmetrical around the mean nor around the median but much more tightly grouped on the left hand side of the mean (and the median) than on its right hand side (i.e. more dispersed), indicating that the water level maxima’s distribution is right-skewed. The second observation is that the difference between the minimum and maximum water level is not very large.[INDICATING WHAT ?]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 325.4 325.6 325.8 325.9 326.1 329.3
The below histogram (i.e. frequency distribution) confirms the above observations. Indeed, we see that the data is right-skewed (i.e. therefore non-normal) indicating that the data is disproportionately distributed on the right where daily maxima happen to have much higher values than usual (i.e. outliers) that need to be investigated.
The blue curve represents the smoothed version of the histogram which reflects more accurately the probability density function of the values. The dark green vertical line draws the median water level value, 325.8 meter above sea, and the light green vertical line draws the average, 325.9. [ADD THE MODE]
Looking at the positions on the x-axis of the mean and the median [and the mode] with regard to the distribution, we see that the mean seems to be a better indicator of the center of the distribution, eventhough it is not centered around it.
The objective behind the EVA is to find valid estimates of the frequency (expressed as a return period) of extreme water level maxima. -> return level is the output
[Add assumptions about observed extreme values from the theory]
The Peak-Over-Threshold method provides a model for independent exceedances (i.e. extreme values) above a large threshold. Therefore, it identifies values that are high enough to be above a designated threshold \(\mu\), above which we consider are the extremes (i.e. highest values). In order to determine an optimal threshold, we apply the MRL-plot and then look at the distribution of the exceedances. The value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold.
Therefore, to model the high water levels, we first proceed to an MRL-plot to choose the optimal threshold \(u\) and then, we use the Peak-over-Threshold method and look at the distribution of the exceedances to assess their probability of occurrence.
Clusters of the extremes correspond to the clustering of the data points that are above the chosen threshold \(u\).
To measure and deal with clustering of the daily water levels, we consider consecutive threshold exceedances to belong to the same cluster. Therefore, by using the POT approach, we can observe on the Peak-Over-Threshold plot (below) the different clusters of the extreme values. Then, we can fit the Generalized Pareto Distribution (GPD) or Point Process model to the cluster maxima (after declustering if the exceedances exhibit autocorrrelation).
First, we start by selecting an appropriate threshold \(u\) to use for fitting the GPD or Point Process models. To do so, we use the mrlplot() function from the extRemes package which plots the potential thresholds against the mean excess.
The mean excess function \(e(u) = E(X - u| X > u)\) is the expected value of the excess (i.e. difference between the exceedance and the high threshold (i.e. \(u\))), knowing that the exceedance is greater than the threshold.
We are interested in finding the distribution of the daily water level maxima that are above the threshold \(u\) (i.e. exceedances), which is the excess distribution function \(F_u(x) = P(X - u \leq x | X > u)\).
The difficulty of choosing an adequate threshold \(u\) lies in the fact that it requires a good balance between selecting a threshold low enough so that we have enough exceedances to obtain useful results (i.e. results precision) but high enough so that the GPD and Point Process methods are approximately valid (i.e. bias).
[Notes for ourselves in case of a question: “If we take the threshold to be so low that most of the data are exceedances then the analysis will only be valid if the raw data came from a generalized Pareto distribution (which will rarely be the case) = 80% of consequances (i.e. exceedances) come from 20% of the causes”]
The below Mean Residual Life Plot indicates for each potential threshold \(u\) its mean excess. As mentioned previously, the value of \(u\) from which the plot becomes approximately linear can generally be selected as the optimal threshold. The dashed curves represent the confidence intervals of the mean excesses which help assess its type (i.e. constant, linear). Therefore, we decide to choose a threshold \(u = 327\ [m.a.s.l]\) form which the curve seems to become linear.
From now on, daily water level maxima above \(327\ [m.a.s.l]\) are considered extreme values.
Therefore, below \(327\ [m.a.s.l]\), a threshold would be too low for the assumptions of the extreme value approach to be valid and above \(327\ [m.a.s.l]\) would be too high to obtain insghtful results because of too few data.
Considering that the mean excess function is used to validate a GPD model for the excess distribution, we see that the mean excess function would roughly be a straight line if we could zoom out on the y-axis, we can say that the GPD is a good fit for modelling the exceedances. [ASK PROFESSOR IF WE CAN SAY THAT IT IS ROUGHLY A STRAIGHT LINE ] distribution.
The below plot shows the results of the Peak-Over-Treshold method. The red line represents the designated appropriate threshold \(u = 327\ [m.a.s.l]\) and the red dots are the extreme daily water maxima recorded and considered as exceedances.
[Try theme_economist for the plot]
The POT approach states that the observations above the high threshold \(u\) (i.e. red observations above 327 m.a.s.l) are in absolute terms not independent but it states that if they are not consecutive above \(u\), they belong to different clusters and therefore are considered independent only if they are at least a certain number of consecutive observations between them.[ASK IF CORRECT ON THRUSDAY]
To avoid dependence between exceedances, we proceed to decluster the observations above the threshold \(u\). To do so, the declustering keeps only the highest water level maxima of one cluster as the key representative observation of the cluster. [VERIFY WITH THE PROFESSOR]
The following plot shows that some exceedances above \(327 m.a.s.l\) become more transparent and some others remain more visible. The more visible observations represent the 79 observations (one observation per cluster declustered) obtained after the declustering.
[TRY WITH ANOTHER THEM]
##
## niveau$Wert declustered via runs declustering.
##
## Estimated extremal index (intervals estimate) = 0.6758134
##
## Number of clusters = 79
##
## Run length = 1
[This plot should look like the one on slide 6 from module 3]
## integer(0)
The GPD is used to describe statistical properties of excesses. Indeed, Extreme Value theory suggests that GPD is a natural approximation of the excess distribution function \(F_u(x)\) where \(x\) is the excess, so \(F_u(x) \approx G_{\xi, \beta}(x)\). [maybe add parameters conditions]
Under the assumption that exceedances are independent (i.e. probability of occurrence of one exceedance does not affect the probability of another), identically distributed (i.e. are random variables) and their distribution is asymptotic (i.e. ?), we fit the GDP model on the daily water level exceedances.
#### Return Levels (50- and 100-year events)